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Abstract — Dictionary  learning  algorithms  have  been  suc¬ 
cessfully  used  for  both  reconstructive  and  discriminative  tasks, 
where  an  input  signal  is  represented  with  a  sparse  linear 
combination  of  dictionary  atoms.  While  these  methods  are 
mostly  developed  for  single-modality  scenarios,  recent  studies 
have  demonstrated  the  advantages  of  feature-level  fusion 
based  on  the  joint  sparse  representation  of  the  multimodal 
inputs.  In  this  paper,  we  propose  a  multimodal  task-driven 
dictionary  learning  algorithm  under  the  joint  sparsity  con¬ 
straint  (prior)  to  enforce  collaborations  among  multiple  ho¬ 
mogeneous/heterogeneous  sources  of  information.  In  this  task- 
driven  formulation,  the  multimodal  dictionaries  are  learned 
simultaneously  with  their  corresponding  classifiers.  The  re¬ 
sulting  multimodal  dictionaries  can  generate  discriminative 
latent  features  (sparse  codes)  from  the  data  that  are  optimized 
for  a  given  task  such  as  binary  or  multiclass  classifica¬ 
tion.  Moreover,  we  present  an  extension  of  the  proposed 
formulation  using  a  mixed  joint  and  independent  sparsity 
prior  which  facilitates  more  flexible  fusion  of  the  modalities 
at  feature  level.  The  efficacy  of  the  proposed  algorithms 
for  multimodal  classification  is  illustrated  on  four  different 
applications  -  multimodal  face  recognition,  multi-view  face 
recognition,  multi-view  action  recognition,  and  multimodal 
biometric  recognition.  It  is  also  shown  that,  compared  to 
the  counterpart  reconstructive-based  dictionary  learning  algo¬ 
rithms,  the  task-driven  formulations  are  more  computationally 
efficient  in  the  sense  that  they  can  be  equipped  with  more 
compact  dictionaries  and  still  achieve  superior  performance. 

Index  Terms — Dictionary  learning,  Multimodal  classifica¬ 
tion,  Sparse  representation,  Feature  fusion 


I.  Introduction 

It  is  well  established  that  information  fusion  using  multi¬ 
ple  sensors  can  generally  result  in  an  improved  recognition 
performance  [1].  It  provides  a  framework  to  combine 
local  information  from  different  perspectives  which  is  more 
tolerant  to  the  errors  of  individual  sources  [2],  [3].  Fusion 
methods  for  classification  are  generally  categorized  into 
feature  fusion  [4]  and  classifier  fusion  [5]  algorithms. 
Feature  fusion  methods  aggregate  extracted  features  from 
different  sources  into  a  single  feature  set  which  is  then  used 
for  classification.  On  the  other  hand,  classifier  fusions  algo¬ 
rithms  combine  decisions  from  individual  classifiers,  each 
of  which  is  trained  using  separate  sources.  While  classifier 
fusion  is  a  well- studied  topic,  fewer  studies  have  been  done 
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for  feature  fusion,  mainly  due  to  the  incompatibility  of  the 
feature  sets  [6].  A  naive  way  of  feature  fusion  is  to  stack 
the  features  into  a  longer  one  [7].  However  this  approach 
usually  suffers  from  the  curse  of  dimensionality  due  to  the 
limited  number  of  training  samples  [4].  Even  in  scenarios 
with  abundant  training  samples,  concatenation  of  feature 
vectors  does  not  take  into  account  the  relationship  among 
the  different  sources  and  it  may  contain  noisy  or  redundant 
data,  which  degrade  the  performance  of  the  classifier  [6]. 
However,  if  these  limitations  are  mitigated,  feature  fusion 
can  potentially  result  in  improved  classification  perfor¬ 
mance  [8],  [9]. 

Sparse  representation  classification  has  recently  attracted 
the  interest  of  many  researchers  in  which  the  input  sig¬ 
nal  is  approximated  with  a  linear  combination  of  a  few 
dictionary  atoms  [10]  and  has  been  successfully  applied 
to  several  problems  such  as  robust  face  recognition  [10], 
visual  tracking  [11],  and  transient  acoustic  signal  classi¬ 
fication  [12].  In  this  approach,  a  structured  dictionary  is 
usually  constructed  by  stacking  ah  the  training  samples 
from  the  different  classes.  The  method  has  also  been 
expanded  for  efficient  feature-level  fusion  which  is  usually 
referred  to  as  multi-task  learning  [13],  [14],  [15],  [16]. 
Among  different  proposed  sparsity  constraints  (priors),  joint 
sparse  representation  has  shown  significant  performance 
improvement  in  several  multi-task  learning  applications 
such  as  target  classification,  biometric  recognitions,  and 
multiview  face  recognition  [12],  [14],  [17],  [18].  The  un¬ 
derlying  assumption  is  that  the  multimodal  test  input  can 
be  simultaneously  represented  by  a  few  dictionary  atoms, 
or  training  samples,  from  a  multimodal  dictionary,  that 
represents  ah  the  modalities  and,  therefore,  the  resulting 
sparse  coefficients  should  have  the  same  sparsity  pattern. 
However,  the  dictionary  constructed  by  the  collection  of 
the  training  samples  suffer  from  two  limitations.  First,  as 
the  number  of  training  samples  increases,  the  resulting 
optimization  problem  becomes  more  computationally  de¬ 
manding.  Second,  the  dictionary  that  is  constructed  this  way 
is  not  optimal  neither  for  the  reconstructive  tasks  [19]  nor 
the  discriminative  tasks  [20]. 

Recently  it  has  been  shown  that  learning  the  dictionary 
can  overcome  the  above  limitations  and  significantly  im¬ 
prove  the  performance  in  several  applications  including 
image  restoration  [21],  face  recognition  [22]  and  object 
recognition  [23],  [24].  The  learned  dictionaries  are  usu¬ 
ally  more  compact  and  have  fewer  dictionary  atoms  than 
the  number  of  training  samples  [25],  [26].  Dictionary 
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learning  algorithms  can  generally  be  categorized  into  two 
groups:  unsupervised  and  supervised.  Unsupervised  dictio¬ 
nary  learning  algorithms  such  as  the  method  of  optimal 
direction  [27]  and  K-SVD  [25]  are  aimed  at  finding  a 
dictionary  that  yields  minimum  errors  when  adapted  to 
reconstruction  tasks  such  as  signal  denoising  [28]  and  im¬ 
age  inpainting  [19].  Although,  the  unsupervised  dictionary 
learning  has  also  been  used  for  classification  [22],  it  has 
been  shown  that  better  performance  can  be  achieved  by 
learning  the  dictionaries  that  are  adapted  to  an  specific 
task  rather  than  just  the  data  set  [29],  [30].  These  methods 
are  called  supervised,  or  task- driven,  dictionary  learning 
algorithms.  For  the  classification  task,  for  example,  it  is 
more  meaningful  to  utilize  the  labeled  data  to  minimize 
the  misclassification  error  rather  than  the  reconstruction 
error  [31].  Adding  a  discriminative  term  to  the  recon¬ 
struction  error  and  minimizing  a  trade-off  between  them 
has  been  proposed  in  several  formulations  [20],  [24], 
[32],  [33].  The  incoherent  dictionary  learning  algorithm 
proposed  in  [34]  is  another  supervised  formulation  which 
trains  class- specific  dictionaries  to  minimize  atom  sharing 
between  different  classes  and  uses  sparse  representation 
for  classification.  In  [35],  a  Fisher  criterion  is  proposed  to 
learn  structured  dictionaries  such  that  the  sparse  coefficients 
have  small  within-class  and  large  between-class  scatters. 
While  unsupervised  dictionary  learning  can  be  reformulated 
as  a  large  scale  matrix  factorization  problem  and  solved 
efficiently  [19],  supervised  dictionary  learning  is  usually 
more  difficult  to  optimize.  More  recently,  it  has  been  shown 
that  better  optimization  tool  can  be  used  to  tackle  the 
supervised  dictionary  learning  [30],  [36].  This  is  achieved 
by  formulating  it  as  a  bilevel  optimization  problem  [37], 
[38].  In  particular,  a  stochastic  gradient  descent  algorithm 
has  been  proposed  in  [29]  which  efficiently  solves  the 
dictionary  learning  problem  in  a  unified  framework  for 
different  tasks,  such  as  classification,  nonlinear  image  map¬ 
ping,  and  compressive  sensing. 

The  majority  of  the  existing  dictionary  learning  algo¬ 
rithms,  including  the  task-driven  dictionary  learning  [29], 
are  only  applicable  to  single  source  of  data.  In  [39],  a  set 
of  view- specific  dictionaries  and  a  common  dictionary  are 
learned  for  the  application  of  multi-view  action  recogni¬ 
tion.  The  view- specific  dictionaries  are  trained  to  exploit 
view-level  correspondence  while  the  common  dictionary  is 
trained  to  capture  common  patterns  shared  among  the  dif¬ 
ferent  views.  The  proposed  formulation  belongs  to  the  class 
of  dictionary  learning  algorithms  that  leverages  the  labeled 
samples  to  learn  class-specific  atoms  while  minimizing 
the  reconstruction  error.  Moreover,  it  cannot  be  used  for 
fusion  of  the  heterogeneous  modalities.  In  [40],  a  generative 
multimodal  dictionary  learning  algorithm  is  proposed  to 
extract  typical  templates  of  multimodal  features.  The  tem¬ 
plates  represent  synchronous  transient  structures  between 
modalities  which  can  be  used  for  localization  applications. 
More  recently,  a  multimodal  dictionary  learning  algorithm 
with  joint  sparsity  prior  is  proposed  in  [41]  for  multimodal 
retrieval  where  the  task  is  to  find  relevant  samples  from 
other  modalities  for  a  given  unimodal  query.  However, 
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Fig.  1 :  Multimodal  task-driven  dictionary  learning  scheme. 


the  proposed  formulation  cannot  be  readily  applied  for 
information  fusion  in  which  the  task  is  to  find  label  of  a 
given  multimodal  query.  Moreover,  the  joint  sparsity  prior 
is  used  in  [41]  to  couple  similarly  labeled  samples  within 
each  modality  and  is  not  utilized  to  extract  cross-modality 
information  which  is  essential  for  information  fusion  [12]. 
Furthermore,  the  dictionaries  in  [41]  are  learned  to  be 
generative  by  minimizing  the  reconstruction  error  of  data 
across  modalities  and,  therefore,  are  not  necessary  optimal 
for  discriminative  tasks  [31]. 

This  paper  focuses  on  learning  discriminative  multimodal 
dictionaries.  The  major  contributions  of  the  paper  are  as 
follows: 

•  Formulation  of  the  multimodal  dictionary  learning 
algorithms :  A  multimodal  task-driven  dictionary  learn¬ 
ing  algorithm  is  proposed  for  classification  using  ho¬ 
mogeneous  or  heterogeneous  sources  of  information. 
Information  from  different  modalities  are  fused  both 
at  the  feature  level,  by  using  the  joint  sparse  repre¬ 
sentation,  and  at  the  decision  level,  by  combining  the 
scores  of  the  modal-based  classifiers.  The  proposed 
formulation  simultaneously  trains  the  multimodal  dic¬ 
tionaries  and  classifiers  under  the  joint  sparsity  prior  in 
order  to  enforce  collaborations  among  the  modalities 
and  obtain  the  latent  sparse  codes  as  the  optimized 
features  for  different  tasks  such  as  binary  and  mul¬ 
ticlass  classification.  Fig.  1  presents  an  overview  of 
the  proposed  framework.  An  unsupervised  multimodal 
dictionary  learning  algorithm  is  also  presented  as  a  by¬ 
product  of  the  supervised  version. 

•  Differentiability  of  the  bi-level  optimization  problem : 
The  main  difficulty  in  proposing  such  a  formulation 
is  that  the  solution  of  the  corresponding  joint  sparse 
coding  problem  is  not  differentiable  with  respect  to 
the  dictionaries.  While  the  joint  sparse  coding  has 
a  non-smooth  cost  function,  it  is  shown  here  that 
it  is  locally  differentiable  and  the  resulting  bi-level 
optimization  for  task-driven  multimodal  dictionary 
learning  is  smooth  and  can  be  solved  using  a  stochastic 
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gradient  descent  algorithm.  1 

•  Flexible  feature-level  fusion:  An  extension  of  the  pro¬ 
posed  framework  is  presented  which  facilitates  more 
flexible  fusion  of  the  modalities  at  the  feature  level  by 
allowing  the  modalities  to  have  different  sparsity  pat¬ 
terns.  This  extension  provides  a  framework  to  tune  the 
trade-off  between  independent  sparse  representation 
and  joint  sparse  representation  among  the  modalities. 

•  Improved  performance  for  multimodal  classification : 
The  proposed  methods  achieve  the  state-of-the-art 
performance  in  a  range  of  different  multi-modal  clas¬ 
sification  tasks.  In  particular,  we  have  provided  exten¬ 
sive  performance  comparison  between  the  proposed 
algorithms  and  some  of  the  competing  methods  from 
literature  for  four  different  tasks  of  multimodal  face 
recognition,  multi-view  face  recognition,  multimodal 
biometric  recognition,  and  multi-view  action  recogni¬ 
tion.  The  experimental  results  on  these  datasets  have 
demonstrated  the  usefulness  of  the  proposed  formula¬ 
tion,  showing  that  the  proposed  algorithm  can  be  read¬ 
ily  applied  to  several  different  application  domains. 

•  Improved  efficiency  for  sparse -representation  based 
classification :  It  is  shown  here  that,  compared  to  the 
counterpart  sparse  representation  classification  algo¬ 
rithms,  the  proposed  algorithms  are  more  computa¬ 
tionally  efficient  in  the  sense  that  they  can  be  equipped 
with  more  compact  dictionaries  and  still  achieve  su¬ 
perior  performance. 

A.  Paper  organization 

The  rest  of  the  paper  is  organized  as  follows.  In  Sec¬ 
tion  II,  unsupervised  and  supervised  dictionary  learning 
algorithms  for  single  source  of  information  are  reviewed. 
Joint  sparse  representation  for  multimodal  classification  is 
also  reviewed  in  this  section.  Section  III  proposes  the  task- 
driven  multimodal  dictionary  learning  algorithms.  Compar¬ 
ative  studies  on  several  benchmarks  and  concluding  results 
are  presented  in  Section  IV  and  Section  V,  respectively. 

B.  Notation 

Vectors  are  denoted  by  bold  lower  case  letters  and 
matrices  by  bold  upper  case  letters.  For  a  given  vector  x , 
Xi  is  its  ith  element.  For  a  given  finite  set  of  indices  7, 
cc7  is  the  vector  formed  with  those  elements  of  x  indexed 
in  7.  Symbol  -A  is  used  to  distinguish  the  row  vectors 
from  column  vectors,  i.e.  for  a  given  matrix  X,  the  ith  row 
and  jth  column  of  matrix  are  represented  as  and  Xj, 
respectively.  For  a  given  finite  set  of  indices  7,  X1  is  the 
matrix  formed  with  those  columns  of  X  indexed  in  7  and 
is  the  matrix  formed  with  those  rows  of  X  indexed 
in  7.  Similarly,  for  given  finite  sets  of  indices  7  and  fj, 
X7_^  is  the  matrix  formed  with  those  rows  and  columns 
of  X  indexed  in  7  and  if,  respectively.  Xij  is  the  element 
of  X  at  row  i  and  column  j.  The  lq  norm,  q  >  1,  of  a 

lrThe  source  code  of  the  proposed  algorithm  is  released  here:  https: 
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vector  x  G  Mm  is  defined  as  \\x\\tq  =  (YYjLi  \xj\qY^q- 
The  Frobenius  norm  and  I\q  norm,  q  >  1,  of  matrix 

X  e  Rmx"  is  defined  as  \\X\\F  =  1  J2j=i  xij) 

and  \\X\\elq  =  YT=i  IKAIv  respectively.  The  collection 
{xl\i  G  7}  is  shortly  denoted  as  {x1}. 

II.  Background 
A.  Dictionary  learning 

Dictionary  learning  has  been  widely  used  in  various  tasks 
such  as  reconstruction,  classification,  and  compressive  sens¬ 
ing  [29],  [33],  [42],  [43].  In  contrast  to  principal  component 
analysis  (PC A)  and  its  variants,  dictionary  learning  algo¬ 
rithms  generally  do  not  impose  orthogonality  condition  and 
are  more  flexible  allowing  to  be  well-tuned  to  the  training 
data.  Let  X  =  [aq,  *2,  ...,xn]  G  MnxAr  be  the  collection 
of  N  (normalized)  training  samples  that  are  assumed  to  be 
statistically  independent.  Dictionary  D  G  Wixd  can  then 
be  obtained  as  the  minimizer  of  the  following  empirical 
cost  [22]: 

1  N 

9n  (D)  =  —  lu  (xi,  D)  (1) 

i=  1 

over  the  regularizing  convex  set  V  =  {D  e 
Mnxd| ||dfc||^2  <  l,Vfc  =  1  ,...,d},  where  d &  is  the  kth 
column,  or  atom,  in  the  dictionary  and  the  unsupervised 
loss  lu  is  defined  as 

lu(x,D)=  min  ||*-£>a||^  +Ai||a||^+A2||a||^  ,  (2) 

which  is  the  optimal  value  of  the  sparse  coding  problem 
with  Ai  and  A2  being  the  regularizing  parameters.  While 
A2  is  usually  set  to  zero  to  exploit  sparsity,  using  A2  >  0 
makes  the  optimization  problem  in  Eq.  (2)  strongly  convex 
resulting  in  a  differentiable  cost  function  [29].  The  index  u 
of  lu  is  used  to  emphasize  that  the  above  dictionary  learning 
formulation  is  an  unsupervised  method.  It  is  well-known 
that  one  is  often  interested  in  minimizing  an  expected 
risk,  rather  than  the  perfect  minimization  of  the  empirical 
cost  [44].  An  efficient  online  algorithm  is  proposed  in  [19] 
to  find  the  dictionary  D  as  the  minimizer  of  the  following 
stochastic  cost  over  the  convex  set  V: 

g  (D)  =  F<x  [lu  (x,  D)]  ,  (3) 

where  it  is  assumed  that  the  data  x  is  drawn  from  a  finite 
probability  distribution  p{x)  which  is  usually  unknown 
and  Ea.  [.]  is  the  expectation  operator  with  respect  to  the 
distribution  p(x). 

The  trained  dictionary  can  then  be  used  to  (sparsely) 
reconstruct  the  input.  The  reconstruction  error  has  been 
shown  to  be  a  robust  measure  for  classification  tasks  [10], 
[45].  Another  use  of  a  given  trained  dictionary  is  for  feature 
extraction  where  the  sparse  code  a*(cc,I9),  obtained  as  a 
solution  of  (2),  is  used  as  a  feature  vector  representing  the 
input  signal  x  in  the  classical  expected  risk  optimization 
for  training  a  classifier  [29]: 

min  Ev,x  [1 l(y,w,a*(x,D ))]  +  ^\\w\\j2,  (4) 

wClW  Z 
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where  y  is  the  ground  truth  class  label  associated  with 
the  input  x,  w  is  model  (classifier)  parameters,  v  is  a 
regularizing  parameter,  and  l  is  a  convex  loss  function  that 
measures  how  well  one  can  predict  y  given  the  feature 
vector  a*  and  classifier  parameters  w.  The  expectation 
E y^x  is  taken  with  respect  to  the  probability  distribution 
p(y,  x )  of  the  labeled  data.  Note  that  in  Eq.  4,  the  dictionary 
D  is  fixed  and  independent  of  the  given  task  and  class  label 
y.  In  task-driven  dictionary  learning,  on  the  other  hand, 
a  supervised  formulation  is  used  which  finds  the  optimal 
dictionary  and  classifier  parameters  jointly  by  solving  the 
following  optimization  problem  [29]: 

„  Vy,*{lsu(y,w,a*(x,D))}  +  ^\\w\\2e  .  (5) 

D<eV,'w<Evv  Z 

The  index  su  of  convex  loss  function  lsu  is  used  to 
emphasize  that  the  above  dictionary  learning  formulation 
is  supervised.  The  learned  task-driven  dictionary  has  been 
shown  to  result  in  a  superior  performance  compared  to  the 
unsupervised  setting  [29].  In  this  setting,  the  sparse  codes 
are  indeed  the  optimized  latent  features  for  the  classifier. 

B.  Multimodal  joint  sparse  representation 

Joint  sparse  representation  provides  an  efficient  tool  for 
feature-level  fusion  of  sources  of  information  [12],  [14], 
[46].  Let  S  =  {1,...,5}  be  a  finite  set  of  available 
modalities  and  let  xs  G  Mn  ,  s  G  5,  be  the  feature 
vector  for  the  sth  modality.  Also  let  Ds  G  WlSxd  be 
the  corresponding  dictionary  for  the  sth  modality.  For 
now,  it  is  assumed  that  the  multimodal  dictionaries  are 
constructed  by  collections  of  the  training  samples  from 
different  modalities,  i.e.  jth  atom  of  dictionary  Ds  is 
the  jth  training  sample  from  the  sth  modality.  Given  a 
multimodal  input  {xs\s  G  5},  shortly  denoted  as  {ccs},  an 
optimal  sparse  matrix  A*  G  RdxS  is  obtained  by  solving 
the  following  ^12 -regularized  reconstruction  problem: 

1  S 

argmin  -  ^  ||*s  -  Dsas\\j  +  \\\A\\e12,  (6) 

A=[a1...as]  ^  S=1 

where  A  is  a  regularization  parameter.  Here  cts  is  the  sth- 
column  of  A  which  corresponds  to  the  sparse  representa¬ 
tion  for  the  sth  modality.  Different  algorithms  have  been 
proposed  to  solve  the  above  optimization  problem  [47], 
[48].  We  use  the  efficient  alternating  direction  method 
of  multipliers  (ADMM)  [49]  to  find  A* .  The  i\2  prior 
encourages  row  sparsity  in  A *,  i.e.  it  encourages  collab¬ 
oration  among  all  the  modalities  by  enforcing  the  same 
dictionary  atoms  from  different  modalities  that  present  the 
same  event,  to  be  used  for  reconstructing  the  inputs  {ccs}. 
An  £u  form  can  also  be  added  to  the  above  cost  function 
to  extend  it  to  a  more  general  framework  where  sparsity 
can  also  be  sought  within  the  rows,  as  will  be  discussed 
in  Section  III-D.  It  has  been  shown  that  joint  sparse 
representation  can  result  in  a  superior  performance  in  fusing 
multimodal  sources  of  information  compared  to  other  infor¬ 
mation  fusion  techniques  [45].  We  are  interested  in  learning 
multimodal  dictionaries  under  the  joint  sparsity  prior.  This 


has  several  advantages  over  a  fixed  dictionary  consisting  of 
training  data.  Most  importantly,  it  can  potentially  remove 
the  redundant  and  noisy  information  by  representing  the 
training  data  in  a  more  compact  form.  Also  using  the 
supervised  formulation,  one  expects  to  find  dictionaries  that 
are  well-adapted  to  the  discriminative  tasks. 

III.  Multimodal  dictionary  learning 

In  this  section,  online  algorithms  for  unsupervised  and 
supervised  multimodal  dictionary  learning  are  proposed. 


A.  Multimodal  unsupervised  dictionary  learning 

Unsupervised  multimodal  dictionary  learning  is  derived 
by  extending  the  optimization  problem  characterized  in 
Eq.  (3)  and  using  the  joint  sparse  representation  of  (6)  to 
enforce  collaborations  among  modalities.  Let  the  minimum 
cost  l'u  ({ari,  Ds })  of  the  joint  sparse  coding  be  defined  as 

mjn  \  E  II®’  -  DSaSW%  +  Ai  11^1^12  +  ylAIlL  (7) 

S=  1 

where  Ai  and  A 2  are  the  regularizing  parameters.  The  addi¬ 
tional  Frobenius  norm  \\.\\f  compared  to  Eq.  (6)  guarantees 
a  unique  solution  for  the  joint  sparse  optimization  problem. 
In  the  special  case  when  5=1,  optimization  (7)  reduces 
to  the  well-studied  elastic-net  optimization  [50].  By  natural 
extension  of  the  optimization  problem  (3),  the  unsupervised 
multimodal  dictionaries  are  obtained  by: 

Ds *  =  argmin Ex»  [l'u  ({xs,Ds})}  ,Vs  e  5,  (8) 

Dsevs 

where  the  convex  set  Vs  is  defined  as 

Vs  ±  {D  eRnSxd\\\dk\\e2  <l,Vfc  =  l,...,d}.  (9) 

It  is  assumed  that  data  xs  is  drawn  from  a  finite  (un¬ 
known)  probability  distribution  p(xs).  The  above  optimiza¬ 
tion  problem  can  be  solved  using  the  classical  projected 
stochastic  gradient  algorithm  [51]  which  consists  of  a 
sequence  of  updates  as  follows: 

Ds  <-  UVs  [Ds  -  ptVnsl'u  ({*?,  Ds })] ,  (10) 

where  pt  is  the  gradient  step  at  time  t  and  Hv  is  the 
orthogonal  projector  onto  set  V.  The  algorithm  converges 
to  a  stationary  point  for  a  decreasing  sequence  of  pt  [51], 
[52].  A  typical  choice  of  pt  is  shown  in  the  next  section. 
This  problem  can  also  be  solved  using  online  matrix 
factorization  algorithm  [26].  It  should  be  noted  that  the 
while  the  stochastic  gradient  descent  does  converge,  it  is 
not  guaranteed  to  converge  to  a  global  minimum  due  to 
the  non-convexity  of  the  optimization  problem  [26],  [44]. 
However,  such  stationary  point  is  empirically  found  to  be 
sufficiently  good  for  practical  applications  [21],  [28]. 
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B.  Multimodal  task-driven  dictionary  learning 

As  discussed  in  Section  II,  the  unsupervised  setting  does 
not  take  into  account  the  label  of  the  training  data,  and 
the  dictionaries  are  obtained  by  minimizing  the  reconstruc¬ 
tion  error.  However,  for  classification  tasks,  the  minimum 
reconstruction  error  does  not  necessarily  result  in  discrimi¬ 
native  dictionaries.  In  this  section,  a  multimodal  task-driven 
dictionary  learning  algorithm  is  proposed  that  enforces 
collaboration  among  the  modalities  both  at  the  feature  level 
using  joint  sparse  representation  and  the  decision  level 
using  a  sum  of  the  decision  scores.  We  propose  to  learn 
the  dictionaries  Ds*,Vs  G  <S,  and  the  classifier  parameters 
ws*,  Vs  G  S ,  shortly  denoted  as  the  set  jointly 

as  the  solution  of  the  following  optimization  problem: 


S=1 


where  /  is  defined  as  the  expected  cumulative  cost: 

s 

f({Ds,ws})=Ey2lsu(y,ws,a s*),  (12) 

S=  1 


where  as*  is  the  sth  column  of  the  minimizer 
A*({ccs,  Z}s})  of  the  optimization  problem  (7)  and 
lsu(y,w,a)  is  a  convex  loss  function  that  measures  how 
well  the  classifier  parametrized  by  w  can  predict  y  by 
observing  a.  The  expectation  is  taken  with  respect  to  the 
joint  probability  distribution  of  the  multimodal  inputs  {#s} 
and  label  y.  Note  that  as*  acts  as  a  hidden/latent  feature 
vector,  corresponding  to  the  input  xs ,  which  is  generated  by 
the  learned  discriminative  dictionary  Ds *.  In  general,  lsu 
can  be  chosen  as  any  convex  function  such  that  lsu(y,  •  >  •) 
is  twice  continuously  differentiable  for  all  possible  values 
of  y.  A  few  examples  are  given  below  for  binary  and 
multiclass  classification  tasks. 

1)  Binary  classification:  In  a  binary  classification  task 
where  the  label  y  belongs  to  the  set  {—1, 1},  lsu  can  be 
naturally  chosen  as  the  logistic  regression  loss 

lsu(y,w,ot.*)  =  log(l  +  c-yt"Ta*),  (13) 

where  w  G  is  the  classifier  parameters.  Once  the 
optimal  {Ds,ws}  are  obtained,  a  new  multimodal  sample 
{xs}  is  classified  according  to  sign  of  Yls=i  ™sT due 
to  the  uniform  monotonicity  of  Yls=i  lsu-  For  simplicity, 
the  intercept  term  for  the  linear  model  is  omitted  here, 
but  it  can  be  easily  added.  One  can  also  use  a  bilinear 
model  where,  instead  of  a  set  of  vectors  {rus},  a  set  of 
matrices  are  learned  and  a  new  multimodal  sample 

is  classified  according  to  the  sign  of  Yls=i  xS  F^sa*. 
Accordingly,  the  ^2-norm  regularization  of  Eq.  (11)  needs 
to  be  replaced  with  the  matrix  Frobenius  norm.  The  bilinear 
model  is  richer  than  the  linear  model  and  can  sometimes 
result  in  better  classification  performance  but  needs  more 
careful  training  to  avoid  over-fitting. 


2 )  Multiclass  classification:  Multiclass  classification  can 
be  formulated  using  a  collections  of  (independently  learned) 
binary  classifiers  in  a  one-vs-one  or  one-vs-all  setting. 
Multiclass  classification  can  also  be  handled  in  an  all-vs-all 
setting  using  the  softmax  regression  loss  function.  In  this 
scheme,  the  label  y  belongs  to  the  set  {1, . . . ,  K}  and  the 
softmax  regression  loss  is  defined  as 

K  ^  (  gW^Ot* 

lsu(y,  W,  a  )  =  —  1  {y=k}  log  I  K  ,wr(y 

k= i  \2^i=ie  1 

(14) 

where  W  =  [wi  . . .  Wk]  G  RdxK ,  and  1{}  is  the  indicator 
function.  Once  the  optimal  {Ds ,  Ws}  are  obtained,  a  new 
multimodal  sample  {xs}  is  classified  as 

A  (  ewfeT“ 

argmaxfe€{1)...jX}  ^  - 

s=i  \  2^i=i  e  1 

In  yet  another  all-vs-all  setting,  the  multiclass  classification 
task  can  be  turned  into  a  regression  task  in  which  the  scaler 
label  y  is  changed  to  a  binary  vector  y  G  RK ,  where  the 
kth  coordinate  corresponding  to  the  label  of  {xs}  is  set  to 
one  and  the  rest  of  the  coordinates  are  set  to  zero.  In  this 
setting,  lsu  is  defined  as 

len(v,W,a*)  =  ±\\y-Wa*\\i,  (16) 

where  W  G  'RKxd.  Having  obtained  the  optimal 
{ Ds ,  Ws },  the  test  sample  {xs}  is  then  classified  as 

s 

argmin ke{h...tK}  H<?fc  ~  Ws<xs*  ||f2,  (17) 

S  =  1 

where  qk  is  a  binary  vector  in  which  its  kth  coordinate  is 
one  and  its  remaining  coordinates  are  zero. 

In  choosing  between  the  one-vs-all  setting,  in  which 
independent  multimodal  dictionaries  are  trained  for  each 
class,  and  the  multiclass  formulation,  in  which  multimodal 
dictionaries  are  shared  between  classes,  a  few  points  should 
be  considered.  In  the  one-vs-all  setting,  the  total  number  of 
dictionary  atoms  is  equal  to  dSK  in  the  if -class  classifi¬ 
cation  while  in  the  multiclass  setting  the  number  is  equal 
to  dS.  It  should  be  noted  that  in  the  multiclass  setting  a 
larger  dictionary  is  generally  required  to  achieve  the  same 
level  of  performance  to  capture  the  variations  among  all 
classes.  However,  it  is  generally  observed  that  the  size 
of  the  dictionaries  in  multiclass  setting  is  not  required  to 
grow  linearly  as  the  number  of  classes  increases  due  to 
atom  sharing  among  the  different  classes.  Another  point  to 
consider  is  that  the  class-specific  dictionaries  of  the  one- 
vs-all  approach  are  independent  and  can  be  obtained  in 
parallel.  In  this  paper,  the  multiclass  formulation  is  used 
to  allow  feature  sharing  among  the  classes. 


C.  Optimization 

The  main  challenge  in  optimizing  (11)  is  the  non¬ 
differentiability  of  A*({xs ,  Ds}).  However,  it  can  be 
shown  that  although  the  sparse  coefficients  A*  are  obtained 
by  solving  a  non-differentiable  optimization  problem,  the 
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function  f  ({Ds ,ws}),  defined  in  Eq.  (12),  is  differen¬ 
tiable  on  V1  x  •  •  •  Vs  x  W1  x  •  •  •  Ws ,  and  therefore  its 
gradients  are  computable.  To  find  the  gradient  of  /  with 
respect  to  Ds,  one  can  find  the  optimality  condition  of  the 
optimization  (7)  or  use  the  fixed  point  differentiation  [36], 
[38]  and  show  that  A *  is  differentiable  over  its  non-zero 
rows.  Without  loss  of  generality,  we  assume  that  label 
y  admits  a  finite  set  of  values  such  as  those  defined  in 
Eqs.  (13)  and  (14).  The  same  algorithm  can  be  derived  for 
the  scenario  when  y  belongs  to  a  compact  subset  of  a  finite¬ 
dimensional  real  vector  space  as  in  Eq.  (16).  A  couple  of 
mild  assumptions  are  required  to  prove  the  differentiability 
of  /  which  are  direct  generalizations  of  those  required  for 
the  single  modal  scenario  [29]  and  are  listed  below: 

Assumption  (A).  The  multimodal  data  (y,  {a?8})  admit  a 
probability  density  p  with  compact  support. 

Assumption  (B).  For  all  possible  values  of  y,  p(y , .) 
is  continuous  and  lsu(y , .)  is  twice  continuously  differen¬ 
tiable. 

The  first  assumption  is  reasonable  when  dealing  with 
the  signal/image  processing  applications  where  the  acquired 
values  obtained  by  the  sensors  are  bounded.  Also  all  the 
given  examples  for  lsu  in  the  previous  section  satisfy  the 
second  assumption.  Before  stating  the  main  proposition  of 
this  paper  below,  the  term  active  set  is  defined. 

Definition  3.1  ( Active  set):  The  active  set  A  of  the  solu¬ 
tion  A*  of  the  joint  sparse  coding  problem  (7)  is  defined 
to  be 

A  =  {j  €{l,...,d}:  \\a%\\e2  ^0},  (18) 

where  a is  the  jth  row  of  A*. 

Proposition  3.1  (Differentiability  and  gradients  of  f): 
Let  A2  >  0  and  the  assumptions  (A)  and  (. B )  hold.  Let 
T  =  UjeA^j  where  Tj  =  {jj  +  d, . . . ,  j  +  (5  -  1  )d}. 
Let  the  matrix  D  G  MnxlTl  be  defined  as 


D  = 


(19) 


where  Dj  =  blkdiag(d], . . . ,  dj)  G  WnxS,\/j  G  A,  is 

the  collection  of  the  jth  active  atoms  of  the  multimodal 
dictionaries,  dsj  is  the  jth  active  atom  of  Ds ,  blkdiag  is 
the  block  diagonalization  operator,  and  n  =  ns .  Also 

let  matrix  A  G  be  defined  as 


A  =  blkdiag(Ai,...,A|A|), 


(20) 


where  A.  = 


iu2 


I  - 


[ _ n  *  T  * 

aJ- 


RSxS,\/j  £  A,  and  I  is  the  identity  matrix.  Then,  the 
function  /  defined  in  Eq.  (12)  is  differentiable  and  Vs  G  S , 

Vwsf  =  E  [Vwslsu  (: y ,  ws,  a8*)] , 


VDsf  =  E  (xs  -  Dsas *)  /3j  -  Ds(3-Sa 


i*T 


(21) 


where  s  =  {s,  s  +  S, . . . ,  s  +  (d  —  ljS1}  and  (3  £  RdS  is 
defined  as 


/3Xc  =  0,  fa  =  ( DtD  +  Ai  A  +  X2I)~1g,  (22) 

in  which  g  =  vec(Vj4*^T  lsu{v,  ws ,  a8*)),  Tc  = 

{1, . . . ,  dS}  \T,  /3j  £  is  formed  of  those  rows  of  (3 
indexed  by  T,  and  vec(.)  is  the  vectorization  operator. 


The  proof  of  this  proposition  is  given  in  the  Appendix. 
A  stochastic  gradient  descent  algorithm  to  find  the  optimal 
dictionaries  {Z9S*}  and  classifiers  is  described  in 

Algorithm  1.  The  stochastic  gradient  descent  algorithm  is 
guaranteed  to  converge  under  a  few  assumptions  that  are 
mildly  stricter  than  those  in  this  paper  (requires  three-times 
differentiability)  [53].  To  further  improve  the  convergence 
of  the  proposed  stochastic  gradient  descent  algorithm,  a 
classic  mini-batch  strategy  is  used  in  which  a  small  batch 
of  the  training  data  are  sampled  in  each  batch,  instead  of  1 
sample,  and  the  parameters  are  updated  using  the  averaged 
updates  of  the  batch.  This  has  additional  advantage  in  which 
DtD  and  the  corresponding  factorization  of  the  ADMM 
for  solving  the  sparse  coding  problem  can  be  computed 
once  for  the  whole  batch.  For  the  special  case  when  5=1, 
the  proposed  algorithm  reduces  to  the  single-modal  task- 
driven  dictionary  learning  algorithm  in  [29] .  Selecting  A2  in 
Eq.  (7)  to  be  strictly  positive  guarantees  the  linear  equations 
of  (22)  to  have  a  unique  solution.  In  other  words,  it  is  easy 
to  show  that  the  matrix  (DT D  +  Ai  A  +  A2 1)  is  positive 
definite  given  Ai  >  0,  A2  >  0.  However,  in  practice  it  is 
observed  that  the  solution  of  the  joint  sparse  representation 
problem  is  numerically  stable  since  D  becomes  full-column 
rank  when  sparsity  is  sought  with  a  sufficiently  large  Ai, 
and  A 2  can  be  set  to  zero.  It  should  be  noted  that  the 
assumption  of  D  being  a  full  column  rank  matrix  is  a 
common  assumption  in  sparse  linear  regression  [26].  As 
in  any  non-convex  optimization  algorithm,  if  the  algorithm 
is  not  initialized  properly,  it  may  yield  poor  performance. 
Similar  to  [29],  the  dictionaries  {Ds}  are  initialized  by  the 
solution  of  the  unsupervised  multimodal  dictionary  learning 
algorithm.  Upon  assignment  of  the  initial  dictionaries, 
parameters  {ws}  of  the  classifiers  are  set  by  solving  (11) 
only  with  respect  to  {ws}  which  is  a  convex  optimization 
problem. 


D.  Extension 

We  now  present  an  extension  of  the  proposed  algorithm 
with  a  more  flexible  structure  on  the  sparse  codes.  Joint 
sparse  representation  relies  on  the  fact  that  all  the  modalities 
share  the  same  sparsity  pattern  in  which,  if  a  multimodal 
training  sample  is  selected  to  reconstruct  the  input,  then 
all  the  modalities  within  that  training  sample  are  active. 
However,  this  group  sparsity  constraint,  imposed  by  the 
^12  norm,  may  be  too  stringent  for  some  applications  [45], 
[54],  for  example  in  the  scenarios  where  the  modalities 
have  different  noise  levels  or  when  the  heterogeneity  of 
the  modalities  imposes  different  sparsity  levels  for  the 
reconstruction  task.  A  natural  relaxation  to  the  joint  sparsity 
prior  is  to  let  the  multimodal  inputs  not  share  the  full 
active  set  which  can  be  achieved  by  replacing  the  ^12  norm 
with  a  combination  of  the  ^12  and  i\\  norms  (i\2  —  ^11 
norm).  Following  the  same  formulation  as  in  Section  III-B, 
let  A*({xs,  Ds})  in  Eq.  (11)  be  the  minimizer  of  the 
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Algorithm  1  Stochastic  gradient  descent  algorithm  for  multi¬ 
modal  task-driven  dictionary  learning. 

Input:  Regularization  parameters  Ai,  A2,  is,  learning  rate  parameters 
p,to,  number  of  iterations  T,  initial  dictionaries  {Ds  £  Us}ses, 
initial  model  parameters  {ws  £  Ws}se«s. 

Output:  Learned  {Ds ,  w s} 

1 :  for  t  =  1 , . . . ,  T  do 

2:  Draw  a  random  sample  ( x \ , . . . ,  xf ,  yt)  from  the  training  data. 

3:  Find  solution  A*  =  [a.*1  . .  .  a:*5]  £  RdxS  of  the  joint  sparse 

coding  problem 

1S  A 

argmin  -  ^  \\xst  -  Ds as \\j  +Ai||A||£l2  +  -^||A|||.. 

A=[«i...«s]  2  s=1  2 

4:  Compute  set  of  active  rows  A  of  A *  using  (18). 

5:  Compute  D  £  MnXlTl  using  (19). 

6:  Compute  A  £  RlTlxlTl  using  (20). 

7:  Compute  (3  £  RdS  as: 

f3rc  —  0,/3y  —  (DD  +  AiA  +  A2-I)  1g , 

where  T  =  Uje\{j,j  +  d, . . .  ,j  +  (5  —  1  )d}  and  g  = 
vec(VA*^T  E?=  1  lsu(yt,ws,as*)). 

8:  Choose  the  learning  rate  «—  min(p,  p 

9:  Update  the  parameters  by  a  projected  gradient  step: 

ws  «-  nw.s  [ ws  -  pt  (Vwslsu  (ytJ  Ws,  a:5*)  +  i/10S)]  , 

Ds  <-  nD.  [l?s  -Pf  ((*?  -  Dsas*)  f)J  -  Ds/3s<xs*T )]  , 

Vs  £  <S,  where  s  =  {s,  s  +  5, . . . ,  s  +  (d  —  1)5}. 

10:  end  for 


following  optimization  problem: 


S=  1 

+  x1\\a\u12  +  x[\\a\u11  +  ^\\a\\2f, 


(23) 


where  X[  is  the  regularization  parameter  for  the  added  t\\ 
norm  and  other  terms  are  the  same  as  those  in  Eq.  (7).  The 
selection  of  Ai  and  X[  influences  the  sparsity  pattern  of 
A*.  Intuitively,  as  Xi/X[  increases,  the  group  constraint  be¬ 
comes  dominant  and  more  collaboration  is  enforced  among 
the  modalities.  On  the  other  hand,  small  values  of  Xi/X[ 
encourage  independent  reconstructions  across  modalities. 
In  the  extreme  case  of  Ai  being  set  to  zero,  the  above 
optimization  problem  is  separable  across  the  modalities. 
The  above  formulation  brings  added  flexibility  with  the  cost 
of  one  additional  design  parameter  which  is  obtained  in  this 
paper  using  cross-validation. 

Here  we  present  how  the  Algorithm  1  should  be  modified 
to  solve  the  supervised  multimodal  dictionary  learning 
problem  under  the  mixed  £ 12  —  i\\  constraint.  The  proof 
for  obtaining  the  algorithm  is  similar  to  the  one  for  the 
norm  and  is  briefly  discussed  in  the  appendix.  In 
Algorithm  1,  let  A k  be  the  solution  of  the  optimization 
problem  (23)  and  let  A  be  the  set  of  its  active  rows.  Let 
C  {1,...,S|A|}  be  the  set  of  indices  with  non-zero 
entries  in  vec(A^T);  i.e.  it  consists  of  non-zero  entries 
of  the  active  rows  of  A*.  Let  D,  A,  and  g  be  the  same  as 
those  defined  in  algorithm  1.  Then,  (3  £  RdS  is  updated  as 


/3xc  —  0,  (3r  —  +  Ai  +  A2 1)  1 


¥ 

Lig.  2:  Extracted  modalities  from  a  sample  in  AR  dataset. 


where  T  is  the  set  of  indices  with  non-zero  entries  in 
vec (A*T)  and  Yc  =  {1, . . . ,  dS} \Y.  Note  that  T  is 
defined  over  the  entire  matrix  A*  while  is  defined  over  its 
active  rows.  The  rest  of  the  algorithm  remains  unchanged. 

IV.  Results  and  discussion 

The  performance  of  the  proposed  multimodal  dictio¬ 
nary  learning  algorithms  are  evaluated  on  the  AR  face 
database  [55],  the  CMU  Multi-PIE  dataset  [56],  the  IXMAS 
action  recognition  dataset  [57]  and  the  WVU  multimodal 
dataset  [58].  Lor  these  algorithms,  ls  is  chosen  to  be 
the  quadratic  loss  of  Eq.  (16)  to  handle  the  multiclass 
classification.  In  our  experiments,  it  is  observed  that  us¬ 
ing  the  multiclass  formulation  achieves  similar  classifi¬ 
cation  performance  compared  to  using  the  logistic  loss 
formulation  of  Eq.  (13)  in  the  one-vs-all  setting.  Regu¬ 
larization  parameters  Ai  and  v  are  selected  using  cross- 
validation  in  the  sets  {0.01  +  0.005&|/c  G  {—3,3}}  and 
{10-2,...,10-9},  respectively.  It  is  observed  that  when  the 
number  of  dictionary  atoms  is  kept  small  compared  to  the 
number  of  training  samples,  v  can  be  arbitrarily  set  to  a 
small  value,  e.g.  v  —  10-8,  for  the  normalized  inputs. 
When  the  mixed  ^12—^11  norm  is  used,  the  regularization 
parameters  Ai  and  A}  are  selected  by  cross-validation 
in  the  set  {0.0001,0.0005,0.001,0.005,0.01,0.05}.  The 
parameter  A 2  is  set  to  zero  in  most  of  the  experiments 
except  when  using  the  t\\  prior  in  Section  IV-B1  where 
a  small  positive  value  for  A2  was  required  for  convergence. 
The  learning  parameter  pt  is  selected  according  to  the 
heuristic  proposed  in  [29],  i.e.  pt  =  min  (p,  where 
p  and  to  are  constants.  This  results  in  a  constant  learning 
rate  during  the  first  to  iterations  and  an  annealing  strategy 
of  1/t  for  the  rest  of  the  iterations.  It  is  observed  that 
choosing  to  =  T/10,  where  T  is  the  total  number  of 
iterations  over  the  whole  training  set,  works  well  for  all 
of  our  experiments.  Different  values  of  p  are  tried  during 
the  first  few  iterations  and  the  one  that  results  in  minimum 
error  on  a  small  validation  set  is  retained.  T  is  set  equal 
to  be  20  in  all  the  experiments.  We  observed  empirically 
that  the  selection  of  these  parameters  is  quite  robust  and 
small  variations  in  their  values  do  not  affect  considerably 
the  obtained  results.  We  also  used  a  mini-batch  size  of  100 
in  all  our  experiments.  It  should  also  be  noted  that  design 
parameters  for  the  competitive  algorithms  are  also  selected 
using  cross-validation  for  a  fair  comparison. 

A.  AR  face  recognition 

The  AR  dataset  consists  of  faces  under  different  poses, 
illumination  and  expression  conditions,  captured  in  two 
sessions.  A  set  of  100  users  are  used,  each  consisting  of 


TABLE  III:  Multimodal  classification  results  obtained  for  the  AR  datasets 
SVM-Maj  SYM-Sum  LR-Maj  LR-Sum  MKL  JSRC  [14]  JDSRC  [7]  SMDL^  SMDL£l2  SMDL£l2_£ll 


85.57  92.14  85.00  91.14  91.14  96.14  96.14  95.86  96.86  97.14 


TABLE  I:  Correct  classification  rates  obtained  using  the 
whole  face  modality  for  the  AR  database. 

SYM  MKL  [59]  LR  SRC  [10]  UDL  SDL  [29] 
86.43  82.86  81.00  88.86  89.58  90.57 


TABLE  II:  Comparison  of  the  i\i  and  in  priors  for  mul¬ 
timodal  classification.  Modalities  include  1.  left  periocular, 
2.  right  periocular,  3.  nose,  4.  mouth,  and  5.  face. 


Modalities 

{1,2} 

{1,2,3} 

{1,2, 3, 4} 

{1,2, 3, 4, 5} 

UMDL£ll 

81.9 

87.57 

90.14 

95.57 

UMDL£i2 

82.6 

87.86 

92.00 

96.29 

SMDL£ii 

83.86 

89.86 

92.42 

95.86 

smdl^12 

86.43 

89.86 

93.57 

96.86 

seven  images  from  the  first  session  as  training  samples  and 
seven  images  from  the  second  session  as  test  samples.  A 
small  randomly  selected  portion  of  the  training  set,  50  out 
of  700,  is  used  as  validation  set  for  optimizing  the  design 
parameters.  Fusion  is  taken  on  five  modalities  which  are 
the  left  and  right  periocular,  nose,  mouth,  and  the  whole 
face  modalities,  similar  to  the  setup  in  [14],  [45].  A  test 
sample  from  the  AR  dataset  and  the  extracted  modalities 
are  shown  in  Fig.  2.  Raw  pixels  are  first  PCA-transformed 
and  then  normalized  to  have  zero  mean  and  unit  1 2  norm. 
The  dictionary  size  for  the  dictionary  learning  algorithms 
is  chosen  to  be  four  per  class,  resulting  in  dictionaries  of 
overall  400  atoms. 

Classification  using  the  whole  face  modality :  The  classi¬ 
fication  results  using  the  whole  face  modality  are  shown  in 
Table  I.  The  results  are  obtained  using  linear  support  vector 
machine  (SVM)  [60],  multiple  kernel  learning  (MKL)  [59], 
logistic  regression  (LR)  [60],  sparse  representation  classi¬ 
fication  (SRC)  [10],  and  unsupervised  and  supervised  dic¬ 
tionary  learning  algorithms  (UDL  and  SDL)  [29].  For  the 
MKL  algorithm,  linear,  polynomial,  and  RBF  kernels  are 
used.  The  UDL  and  SDL  are  equipped  with  the  quadratic 
classifier  (16).  The  SDL  results  in  the  best  performance. 

in  vs  in  sparse  priors  for  multimodal  classification :  A 
straightforward  way  of  utilizing  the  single-modal  dictionary 
learning  algorithms,  namely  UDL  and  SDL,  for  multimodal 
classification  is  to  train  independent  dictionaries  and  clas¬ 
sifiers  for  each  modality  and  then  combine  the  individual 
scores  for  a  fused  decision.  This  way  of  fusion  is  equivalent 
to  using  the  in  norm  on  A,  instead  of  i  12  norm,  in 
Eq.  (7)  (or  setting  Ai  to  zero  in  Eq.  (23))  which  does  not 
enforce  row  sparsity  in  the  sparse  coefficients.  We  denote 
the  corresponding  unsupervised  and  supervised  multimodal 
dictionary  learning  algorithms  using  only  the  in  norm  as 
UMDL^  and  SMDL^1X,  respectively.  Similarly,  the  pro¬ 
posed  unsupervised  and  supervised  multimodal  dictionary 
learning  algorithms  using  the  i  n  norm  are  denoted  as 


TABLE  IV:  Comparison  of  the  reconstructive-based  (JSRC 
and  JSRC-UDL)  and  the  proposed  discriminative-based 
(SMDL^12)  classification  algorithms  obtained  using  the 
joint  sparsity  prior  for  different  numbers  of  dictionary 
atoms  per  class  on  the  AR  dataset. 


atoms/class 

JSRC 

JSRC-UDL 

smdl^12 

1 

46.14 

71.71 

91.28 

2 

69.00 

78.86 

95.00 

3 

79.57 

83.57 

95.71 

4 

88.14 

91.14 

96.86 

5 

91.00 

94.85 

97.14 

6 

94.43 

96.28 

96.71 

7 

96.14 

96.14 

96.00 

UMDL^12  and  SMDL^12.  Table  II  compares  the  perfor¬ 
mance  of  the  multimodal  dictionary  learning  algorithms 
under  the  two  priors.  As  shown,  the  proposed  algorithms 
with  i  12  prior,  which  enforces  collaborations  among  the 
modalities,  have  better  fusion  performances  than  those  with 
in  prior.  In  particular,  SMDL^12  has  significantly  better 
performance  than  the  SMDL^  for  fusion  of  the  first  and 
second  (left  and  right  periocular)  modalities.  This  agrees 
with  the  intuition  that  these  modalities  are  highly  correlated 
and  learning  the  multimodal  dictionaries  jointly  indeed 
improves  the  recognition  performance. 

Comparison  with  other  fusion  methods'.  The  perfor¬ 
mances  of  the  proposed  fusion  algorithms  under  different 
sparsity  priors  are  compared  with  those  of  the  several  state- 
of-the-art  decision-level  and  feature-level  fusion  algorithms. 
In  addition  to  in  and  in  priors,  we  evaluate  the  proposed 
supervised  multimodal  dictionary  learning  algorithm  with 
the  mixed  in— in  norm  which  is  denoted  as  SMDL^12_^n. 
One  way  to  achieve  decision-level  fusion  is  to  train  in¬ 
dependent  classifiers  for  each  modality  and  aggregate  the 
outputs  by  either  adding  the  corresponding  scores  of  each 
modality  to  come  up  with  the  fused  decision,  or  using  the 
majority  voting  among  the  independent  decisions  obtained 
from  different  modalities.  These  approaches  are  abbrevi¬ 
ated  with  Sum  and  Maj ,  respectively,  and  are  used  with 
SVM  and  LR  classifiers  for  decision-level  fusion.  The  pro¬ 
posed  methods  are  also  compared  with  feature-level  fusion 
methods  including  the  joint  sparse  representation  classifier 
(JSRC)  [14],  joint  dynamic  sparse  representation  classifier 
(JDSRC)  [7],  and  MKL.  For  the  JSRC  and  JDSRC,  the 
dictionary  consists  of  all  the  training  samples.  Table  III 
compares  the  performance  of  our  proposed  algorithms  with 
the  other  fusion  algorithms  for  the  AR  dataset.  As  expected, 
the  multimodal  fusion  results  in  significant  performance 
improvement  compared  to  using  only  the  whole  face  modal¬ 
ity.  Moreover,  the  proposed  SMDL^12  and  SMDL^-^ 
achieve  the  superior  performances. 

Reconstructive  vs  discriminative  formulation  with  joint 
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Fig.  3:  Computational  time  required  to  solve  the  optimiza¬ 
tion  problem  (7)  for  a  given  test  sample. 


TABLE  V:  Comparison  of  the  supervised  multimodal  dic¬ 
tionary  learning  algorithms  with  different  sparsity  priors  for 
face  recognition  under  occlusion  on  the  AR  dataset. 

SMDLi12  SMDLi^  SMDL^12^11 

89.00  90.54  91.15 


sparsity  prior.  Comparison  of  the  algorithms  with  joint 
sparsity  priors  in  Table  III  indicates  that  the  proposed 
SMDL^12  algorithm  equipped  with  dictionaries  of  size  400 
achieves  relatively  better  results  than  the  JSRC  that  uses 
dictionaries  of  size  700.  The  results  confirm  the  idea  that 
by  using  the  supervised  formulation,  compared  to  using  the 
reconstruction  error,  one  can  achieve  better  classification 
performance  even  with  more  compact  dictionaries.  For 
further  comparison,  an  experiment  is  performed  in  which 
the  correct  classification  rates  of  the  reconsturtive  and 
discriminative  formulations  are  compared  when  the  their 
dictionary  sizes  are  kept  equal.  For  a  given  number  of 
dictionary  atoms  per  class  d,  dictionaries  of  JSRC  are 
thus  constructed  by  random  selection  of  d  train  samples 
from  different  classes.  This  is  different  from  the  standard 
JSRC,  utilized  for  the  results  in  Table  III,  in  which  all  the 
training  samples  are  used  to  construct  the  dictionaries  [14]. 
Moreover,  to  utilize  all  the  available  training  samples  for 
the  reconstructive  approach  and  make  a  more  meaningful 
comparison,  we  use  the  unsupervised  multimodal  dictionary 
learning  algorithm  of  Eq.  (8)  to  train  class-specific  sub¬ 
dictionaries  which  minimizes  the  reconstruction  error  in 
approximating  the  training  samples  for  a  given  class.  These 
sub-dictionaries  are  then  stacked  to  construct  the  final 
dictionaries,  similar  to  the  approach  in  [22].  We  call  this 
algorithm  as  JSRC-UDL  to  indicate  that  the  dictionaries  are 
indeed  learned  by  the  reconstructive  formulation.  Table  IV 
summarizes  the  recognition  performance  of  JSRC  and 
JSRC-UDL  in  comparison  to  the  proposed  SMDL^12 ,  which 
enjoys  a  discriminative  formulation,  for  different  number  of 
dictionary  atoms  per  class.  As  seen,  SMDL^12  outperforms 
the  reconstructive  approaches,  especially  when  the  number 
of  dictionary  is  chosen  to  be  relatively  small.  This  is  the 
main  advantage  of  SMDL^12  compared  to  the  reconstructive 
approaches  in  which  more  compact  dictionaries  can  be 


-30°  -15°  0°  15°  30° 


-90° 


75° 

Q  id  90° 


Fig.  4:  Configurations  of  the  cameras  and  sample  multi¬ 
view  images  from  CMU  Multi-Pie  dataset. 


used  for  the  recognition  task  that  is  important  for  the  real¬ 
time  applications.  It  is  clear  that  reconstructive  model  can 
only  result  in  comparable  performance  when  the  dictionary 
size  is  chosen  to  be  relatively  large.  On  the  other  hand, 
the  SMDL^12  algorithm  may  get  over-fitted  with  the  large 
number  of  dictionary  atoms.  In  terms  of  computational 
expense  at  test  time,  as  discussed  in  [14],  the  time  required 
to  solve  the  optimization  problem  (7)  is  expected  to  be 
linear  in  the  dictionary  size  using  the  efficient  ADMM  if  the 
required  matrix  factorization  is  cashed  beforehand.  Typical 
computational  time  to  solve  (7)  for  a  given  multimodal  test 
sample  is  shown  in  Fig.  3  for  different  dictionary  sizes.  As 
expected,  it  increases  linearly  as  the  size  of  the  dictionary 
increases.  This  illustrates  the  advantage  of  the  SMDL£12 
algorithm  that  results  in  the  state-of-the-art  performance 
with  more  compact  dictionaries. 

Classification  in  presence  of  disguise :  The  AR  dataset 
also  contains  600  occluded  samples  per  session,  overall 
1200  images,  where  the  faces  are  disguised  using  sun 
glasses  or  scarf.  Here  we  use  these  additional  images  to 
evaluate  the  robustness  of  the  proposed  algorithms.  Similar 
to  previous  experiments,  images  from  session  1  are  used 
as  training  samples  and  images  from  session  2  are  used 
as  test  data.  Classification  performance  under  different 
sparsity  priors  are  shown  in  Table  V  and  as  expected,  the 
SMDL^-^  achieves  the  best  performance.  In  presence 
of  occlusion,  some  of  the  modalities  are  less  coupled  and 
the  joint  sparsity  prior  among  all  the  modalities  may  be  too 
stringent  as  is  also  reflected  in  the  results. 


B.  Multi-view  recognition 

1)  Multi-view  face  recognition:  In  this  section,  the 
performance  of  the  proposed  algorithm  is  evaluated  for 
multi-view  face  recognition  using  the  CMU  Multi-PIE 
dataset  [56].  The  dataset  consists  of  a  large  num¬ 
ber  of  face  images  under  different  illuminations,  view¬ 
points,  and  expressions  which  are  recorded  in  four 
sessions  over  the  span  of  several  months.  Subjects 
were  imaged  using  13  cameras  at  different  view-angles 
of  {0°,  ±15°,  ±30°,  ±45°,  ±60°,  ±75°,  ±90°}  at  head 
height.  Illustrations  for  the  multiple  camera  configurations, 
as  well  as  sample  multi- view  images  are  shown  in  Fig.  4. 
We  use  the  multi- view  face  images  for  129  subjects  that  are 
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TABLE  VI:  Correct  classification  rates  obtained  using 
individual  modalities  in  the  CMU  Multi-PIE  database. 

View  SYM  MKL  LR  SRC  UDL  SDL 

Left  47.30  52.85  43.65  49.85  47.80  50.45 

Frontal  41.15  54.10  45.40  54.25  52.10  56.10 

Right  47.30  51.85  42.85  52.55  43.10  48.50 


TABLE  VII:  Correct  classification  rates  (CCR)  obtained 
using  multi- view  images  on  the  CMU  Multi-PIE  database. 


Algorithm 

CCR 

Algorithm 

CCR 

SVM-Maj 

62.95 

LR-Maj 

69.40 

SVM-Sum 

69.30 

LR-Sum 

71.10 

MKL 

72.40 

JSRC 

73.30 

JDSRC 

70.20 

UMDL€ll 

74.80 

SMDL£ll 

77.25 

umdl^12 

70.50 

SMDLg12 

76.10 

SMDL^-^ 

81.30 

present  in  all  sessions.  The  face  regions  for  all  the  poses 
are  extracted  manually  and  resized  to  10  x  8.  Similar  to 
the  protocol  used  in  [61],  images  from  session  1  at  views 
{0°,  ±30°,  ±60°,  ±90°}  are  used  as  training  samples.  Test 
images  are  obtained  from  all  available  view  angles  from 
session  2  to  have  a  more  realistic  scenario  in  which  not 
all  the  testing  poses  are  available  in  the  training  set. 
To  handle  multi-view  recognition  using  the  multi-modal 
formulation,  we  divide  the  available  views  into  three  sets  of 
{-90°,  -75°,  -60°,  -45°},  {-30°,  -15°,  0°,  15°,  30°,  }, 
{45°,  60°,  75°,  90°},  each  of  which  forms  a  modality.  A 
test  sample  is  then  constructed  by  randomly  selecting  an 
image  from  each  modality.  Two  thousand  test  samples  are 
generated  in  this  way.  The  dictionary  size  for  the  dictionary 
learning  algorithms  is  chosen  to  have  two  atoms  per  class. 

The  classification  results  obtained  using  individual 
modalities  are  shown  in  Table  VI.  As  expected,  better 
classification  performance  is  obtained  using  the  frontal 
view.  Results  of  the  multi-view  face  recognition  is  shown 
in  Table  VII.  The  proposed  supervised  dictionary  learn¬ 
ing  algorithms  outperform  the  corresponding  unsupervised 
methods  and  other  fusion  algorithms.  The  SMDL^-^ 
results  in  the  state-of-the-art  performance.  It  is  consistently 
observed  in  all  the  studied  applications  that  the  multimodal 
dictionary  learning  algorithm  with  the  mixed  prior  results 
in  better  performance  than  those  with  individual  or 
£12  prior.  However,  it  requires  one  additional  regularizing 
parameter  to  be  tuned.  For  the  rest  of  the  paper,  the 
performance  of  the  proposed  dictionary  learning  algorithms 
are  only  reported  under  the  individual  priors. 

2)  Multi-view  action  recognition:  This  section  presents 
the  results  of  the  proposed  algorithm  for  the  pur¬ 
pose  of  multi-view  action  recognition  using  the  IXMAS 
dataset  [57].  Each  action  is  recorded  simultaneously  by 
cameras  from  five  different  viewpoints,  which  are  con¬ 
sidered  as  modalities  in  this  experiment.  A  multimodal 
sample  of  the  IXMAS  dataset  is  shown  in  Fig.  5.  The 
dataset  contains  11  action  classes  where  each  action  is 
repeated  three  times  by  each  of  the  ten  actors,  resulting 
in  330  sequences  per  view.  The  dataset  include  actions 


Fig.  5:  Sample  frames  of  the  IXMAS  dataset  from  5 
different  views. 


TABLE  VIII:  Correct  classification  rates  (CCR)  obtained 
for  multi- view  action  recognition  on  the  IXMAS  database. 


Algorithm 

CCR 

Algorithm 

CCR 

Junejo  et  al.  [65] 

79.6 

Tran  and  Sorokin  [62] 

80.2 

Wu  et  al.  [66] 

88.2 

Wang  et  al.  1  [63] 

87.8 

Wang  et  al.  2  [63] 

93.6 

JSRC 

93.6 

UMDL^n 

90.3 

SMDL^11 

93.9 

UMDL£l2 

90.6 

SMDL^12 

94.8 

such  as  check  watch,  cross  arms,  and  scratch  head.  Similar 
to  the  work  in  [57],  [62],  [63],  leave-one-actor-out  cross- 
validation  is  performed  and  samples  from  all  five  views  are 
used  for  training  and  testing. 

We  use  dense  trajectories  as  features  which  are  generated 
using  the  publicly  available  code  [63]  in  which  a  2000 
word  codebook  is  generated  by  a  random  subset  of  these 
trajectories  and  the  k-means  clustering  as  in  [64].  Note  that 
Wang  et  al.  [63]  used  HOG,  HOF,  and  MBH  descriptors  in 
addition  to  the  dense  trajectories.  However,  here  only  dense 
trajectory  descriptors  are  used.  The  number  of  dictionary 
atoms  for  the  proposed  dictionary  learning  algorithms  are 
chosen  to  be  4  atoms  per  class,  resulting  in  a  dictionary  of 
44  atoms  per  view.  The  five  dictionaries  for  JSRC  are  con¬ 
structed  using  all  the  training  samples,  thus  each  dictionary, 
corresponding  to  a  different  view,  has  297  atoms. 

Table  VIII  shows  average  accuracies  over  all  classes 
obtained  using  the  existing  algorithms  and  the  state  of 
the  art  algorithms.  The  Wang  et  al.  1  [63]  algorithm  uses 
only  the  dense  trajectories  as  feature,  similar  to  our  setup. 
The  Wang  et  al.  2  [63]  algorithm,  however,  uses  HOG, 
HOF,  MBH  descriptors  and  the  spatio-temporal  pyramids 
in  addition  to  the  trajectory  descriptor.  The  results  show 
that  the  proposed  SMDL^12  algorithm  achieves  the  superior 
performance  while  the  SMDL^  algorithm  achieves  the 
second  best  performance.  This  indicates  that  sparse  coeffi¬ 
cients  generated  by  the  trained  dictionaries  are  indeed  more 
discriminative  than  the  engineered  features.  The  resulting 
confusion  matrix  of  the  SMDL^12  algorithm  is  shown  in 
Fig.  6. 

C.  Multimodal  biometric  recognition 

The  WVU  dataset  consists  of  different  biometric  modali¬ 
ties  such  as  fingerprint,  iris,  palmprint,  hand  geometry,  and 
voice  from  subjects  of  different  age,  gender,  and  ethnicity. 
It  is  a  challenging  data  set,  as  many  of  the  samples 
are  corrupted  with  blur,  occlusion,  and  sensor  noise.  In 
this  paper,  two  irises  (left  and  right)  and  four  fingerprint 
modalities  are  used.  The  evaluation  is  done  on  a  subset 
of  202  subjects  which  have  more  than  four  samples  in  all 
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TABLE  IX:  Correct  classification  rates  obtained  using  individual  modalities  in  the  WVU  database. 


Finger  1  Finger  2  Finger  3  Finger  4  Iris  1  Iris  2 


SVM 

56.77 

± 

0.72 

82.95 

zb 

2.15 

55.83 

zb 

2.03 

80.47 

zb 

0.91 

60.67 

zb 

1.78 

57.52 

zb 

1.95 

MKL 

61.81 

zb 

1.39 

82.55 

zb 

1.47 

63.50 

zb 

1.75 

81.85 

zb 

0.74 

56.31 

zb 

2.20 

54.49 

zb 

0.79 

LR 

55.64 

zb 

1.89 

81.10 

zb 

1.85 

55.21 

zb 

2.21 

78.82 

zb 

0.66 

55.25 

zb 

1.48 

56.86 

zb 

1.70 

SRC 

67.66 

zb 

1.86 

88.68 

zb 

1.59 

69.29 

± 

0.77 

88.68 

zb 

1.03 

65.43 

zb 

1.24 

67.78 

zb 

1.76 

UDL 

64.68 

zb 

2.11 

87.35 

zb 

2.23 

67.35 

zb 

1.22 

86.40 

zb 

0.70 

64.36 

zb 

1.37 

65.23 

zb 

2.02 

SDL 

66.29 

zb 

1.81 

88.84 

zb 

2.31 

68.61 

zb 

1.30 

87.50 

zb 

0.82 

66.05 

zb 

0.75 

67.31 

zb 

1.38 

0.04 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 

0.06 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.03 

0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.09 

0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 

0.00 
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Fig.  6:  The  confusion  matrix  obtained  by  the  SMDL£12 
algorithm  on  the  IXMAS  dataset.  The  actions  are  1:  check 
watch,  2:  cross  arms,  3:  scratch  head,  4:  sit  down,  5:  get 
up,  6:  turn  around,  7:  walk,  8:  wave,  9:  punch,  10:  kick  and 
11:  pick  up. 


modalities.  Samples  from  different  modalities  are  shown  in 
Fig.  1.  The  training  set  is  formed  by  randomly  selecting 
four  samples  from  each  subject,  overall  808  samples.  The 
remaining  509  samples  are  used  for  testing.  The  features 
used  here  are  those  described  in  [14]  which  are  further 
PCA-transformed.  The  dimension  of  the  input  data  after 
preprocessing  are  178  and  550  for  the  fingerprint  and  iris 
modalities,  respectively.  All  inputs  are  normalized  to  have 
zero  mean  and  unit  I2  norm.  The  number  of  dictionary 
atoms  for  the  dictionary  learning  algorithms  are  chosen  to 
be  2  per  class,  resulting  in  dictionaries  of  overall  404  atoms. 
The  dictionaries  for  JSRC  and  JDSRC  are  constructed  using 
all  the  training  samples. 

The  classification  results  obtained  using  individual 
modalities  on  5  different  splits  of  the  data  into  training  and 
test  samples  are  shown  in  Table  IX.  As  shown,  finger  2  is 
the  strongest  modality  for  the  recognition  task.  The  SRC 
and  SDL  algorithms  achieve  the  best  results.  It  should  be 
noted  that  dictionary  size  of  SRC  is  twice  of  that  in  SDL. 

For  multimodal  classification,  we  consider  fusion  of 
fingerprints,  fusion  of  Irises,  and  fusion  of  all  the  modali¬ 
ties.  Table  X  summarizes  the  correct  classification  rates  of 
several  fusion  algorithms  using  4  fingerprints,  2  Irises,  and 
all  the  modalities,  obtained  on  5  different  training  and  test 
splits.  Fig.  7  shows  the  corresponding  cumulative  matched 
score  curves  (CMC)  for  the  competitive  methods.  CMC  is 
a  performance  measure,  similar  to  ROC,  which  is  origi- 


TABLE  X:  Multimodal  classification  results  obtained  for 
the  WVU  dataset. 


Algorithm 

4  Fingerprints 

2  Irises 

All  modalities 

SVM-Maj 

90.14  zb  0.70 

65.30  ±  1.92 

95.24  ±0.92 

SVM-Sum 

93.56  zb  1.26 

74.03  ±  1.89 

97.09  ±0.83 

LR-Maj 

89.23  zb  1.63 

63.73  ±  1.29 

94.18  ±  1.13 

LR-Sum 

93.60  zb  0.96 

71.43  ±  1.91 

98.51  ±0.18 

MKL 

93.28  zb  1.52 

67.23  ±0.70 

94.46  ±  0.87 

JSRC 

97.64  zb  0.44 

82.94  ±0.78 

98.89  ±0.30 

JDSRC 

97.17  zb  0.26 

79.61  ±0.70 

97.80  ±0.51 

UMDLn 

97.09  zb  0.56 

80.90  ±0.61 

98.62  ±  0.46 

SMDL£ll 

97.41  ±0.71 

82.83  ±0.87 

98.66  ±  0.43 

umdl£i2 

96.78  ±0.57 

81.53  ±2.18 

98.78  ±  0.43 

smdl^12 

97.56  ±  0.41 

83.77  ±  0.89 

99.10  ±  0.30 

nally  proposed  for  biometric  recognition  systems  [67].  As 
seen,  the  SMDL^12  algorithm  outperforms  the  competitive 
algorithms  and  achieves  the  state-of-the-art  performance 
using  the  Irises  and  all  modalities  with  the  rank  one 
recognition  rate  of  83.77%  and  99.10%,  respectively.  Using 
the  fingerprints,  the  performance  of  the  SMDL^12  is  close  to 
the  best  performing  algorithm,  which  is  JSRC.  The  results 
suggest  that  using  joint  sparsity  prior  indeed  improves  the 
multimodal  classification  performance  by  extracting  the 
coupled  information  among  the  modalities. 

Comparison  of  the  algorithms  with  the  joint  sparsity 
priors  indicates  that  the  proposed  SMDL^12  algorithm 
equipped  with  dictionaries  of  size  404  achieves  comparable, 
and  mostly  better,  results  than  the  JSRC  that  uses  dictionary 
of  size  808.  Similar  to  the  experiment  in  Section  IV- A, 
we  compared  the  reconstructive  and  discriminating  algo¬ 
rithms  that  are  based  on  the  joint  sparsity  prior  when  the 
number  of  dictionary  atoms  per  class  is  kept  equal.  Fig.  8 
summarizes  the  results  of  the  different  fusion  scenarios. 
As  seen,  SMDL^12  significantly  outperforms  JSRC  and 
JSRC-UDL  when  the  number  of  dictionary  atoms  per  class 
is  chosen  to  be  1  or  2.  The  results  are  consistent  with 
that  of  Table  IV  for  the  AR  dataset  indicating  that  the 
proposed  supervised  formulation  equipped  with  more  com¬ 
pact  dictionaries  achieves  superior  performance  than  that 
of  the  reconstructive  formulation  for  the  studied  biometric 
recognition  applications. 

V.  Conclusions  and  Future  Works 

The  problem  of  multimodal  classification  using  sparsity 
models  was  studied  and  a  task-driven  formulation  was 
proposed  to  jointly  find  the  optimal  dictionaries  and  clas¬ 
sifiers  under  the  joint  sparsity  prior.  It  was  shown  that  the 
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CMCs  for  fusion  of  Irises 


Fig.  7:  CMC  plots  obtained  by  fusing  the  Irises  (top), 
fingerprints  (middle),  and  all  modalities  (below)  on  the 
WVU  dataset. 


resulting  bi-level  optimization  problem  is  smooth  and  an 
stochastic  gradient  descent  algorithm  was  proposed  to  solve 
the  corresponding  optimization  problem.  The  algorithm 
was  then  extended  for  a  more  general  scenario  where 
the  sparsity  prior  was  the  combination  of  the  joint  and 
independent  sparsity  constraints.  The  simulation  results  on 
the  studied  image  classification  applications  suggest  that 
while  the  unsupervised  dictionaries  can  be  used  for  feature 
learning,  the  sparse  coefficients  generated  by  the  proposed 
multimodal  task-driven  dictionary  learning  algorithms  are 
usually  more  discriminative  and  therefore  can  result  in 
improved  multimodal  classification  performance.  It  was 
also  shown  that,  compared  to  the  sparse-representation 
classification  algorithms  (JSRC,  JDSRC,  and  JSRC-UDL), 
the  proposed  algorithms  can  achieve  significantly  better 
performance  when  compact  dictionaries  are  utilized. 

In  the  proposed  dictionary  learning  framework  which 


Irises  Fingerprints  All  Modalities 


■JSRC 

□JSRC-DL 

■smdu12 


Fig.  8:  Comparison  of  the  reconstructive-based  (JSRC 
and  JSRC-UDL)  and  the  proposed  discriminative-based 
(SMDL^12)  classification  algorithms  obtained  using  the 
joint  sparsity  prior  for  different  numbers  of  dictionary 
atoms  per  class  on  the  WVU  dataset. 


utilizes  the  stochastic  gradient  algorithm,  the  learning  rate 
should  be  carefully  chosen  for  convergence  of  the  algo¬ 
rithm.  In  out  experiments,  a  heuristic  was  used  to  control 
the  learning  rate.  Topics  of  future  research  include  develop¬ 
ing  of  better  optimization  tools  for  fast  convergence  guaran¬ 
tee  in  this  non-convex  setting.  Moreover,  developing  task- 
driven  dictionary  learning  algorithms  under  other  proposed 
structured  sparsity  priors  for  multimodal  fusion  such  as  the 
tree- structured  sparsity  prior  [45],  [68]  is  another  future 
research  topic.  Future  research  will  also  include  adapting  of 
the  proposed  algorithms  for  other  multimodal  tasks  such  as 
multimodal  retrieval,  multimodal  action  recognition  using 
Kinect  data,  and  image  super-resolution. 


Appendix 


The  proof  of  Proposition  3.1  is  presented  using  the 
following  two  results. 

Lemma  A.  1  (Optimality  condition):  The  matrix  A *  = 

i  T 

*  T  *  T 


1  *  c* 

a1  . . . or 


a  T 


...  a 


d-> 


ixS 


is  a  min- 


imizer  of  (7)  if  and  only  if  ,  Vj  G  {1, . . . ,  d}, 

'  [df  (V  -  DW*)  . . .  df  (A  -  Dscxsf 


—  \2oA_^  —  Ai 


11*2 


if  7^  o, 


||  [df  (x1  -  DW*)  . . .  df  (xs  -  Dsasf 

—  A2a^||^2  <  Ai, otherwise. 

(24) 

Proof:  .  The  proof  follows  directly  from  the  subgradi¬ 
ent  optimality  condition  of  (7),  i.e. 

0  e  {  [diT  (l^a1*  -  X1)  . . .  dsT  (r>sa5*  -  a5) 

+  A2A*  +  Ai P  :  P  G  <9||A*||,£12}, 


where  9||A*||^12  denotes  the  subgradient  of  the  i\2  norm 
evaluated  at  A*.  As  shown  in  [69],  the  subgradient  is 
characterized,  for  all  j  G  {1, . . . ,  d},  as  pj =  ||Q°/^ 
if  «/  ►Ik.  >  0,  and  \\pj^ \\i2  <  1  otherwise.  ■ 
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Before  proceeding  to  the  next  proposition,  we  need  to 
define  the  term  transition  point.  For  a  given  {ccs},  let  A\ 
be  the  active  set  of  the  solution  A*  of  (7)  when  Ai  =  A. 
Then  A  is  defined  to  be  a  transition  point  of  {xs}  if  Aa+c  7^ 
Aa-c,  Ve  >  0. 

Proposition  A.  1  (Regularity  of  A*):  Let  A2  >  0  and 
assumption  (A)  be  hold.  Then, 

Part  1.  A*({a?s,  Ds})  is  a  continuous  function  of  {ccs} 
and  {19s}. 

Part  2.  If  Ai  is  not  a  transition  point  of  {xs},  then 
the  active  set  A  of  A*({xs ,  Ds})  is  locally  constant  with 
respect  to  both  {xs}  and  {i9s}.  Moreover,  A*({xs ,  Ds}) 
is  locally  differentiable  with  respect  to  {Ds}. 

Part  3.  VAi  >  0,3  a  set  AT\1  of  measure  zero  in  which 
V{xs}  G  {Mns}\A/A19  Ai  is  not  any  of  the  transition  points 
of  {#s}. 

Proof:  .  Part  1.  In  the  special  case  of  S  =  1,  which 
is  equivalent  to  an  elastic  net  problem,  this  has  already 
been  shown  [19],  [70].  Our  proof  follows  similar  steps. 
Assumption  (A)  guarantees  that  A *  is  bounded.  Therefore, 
we  can  restrict  the  optimization  problem  (7)  to  a  compact 
subset  of  RdxS .  Since  A*  is  unique  (imposed  by  A2  >  0) 
and  the  cost  function  of  (7)  is  continuous  in  A  and  each 
element  of  the  set  {xs,  Ds}  is  defined  over  a  compact  set, 
A*({ccs,  19s})  is  a  continuous  function  of  {xs}  and  {i9s}. 

Part  2  and  Part  3.  These  statements  are  proved  here  by 
converting  the  optimization  problem  (7)  into  an  equivalent 
group  lasso  problem  [71]  and  using  some  recent  results 
on  it.  Let  the  matrix  D '■  =  blkdiag(d], . . . ,  dj  )  G 

Mnx5'5Vj  G  {1, . . . ,  d},  be  the  block-diagnoal  collection 
of  the  jth  atoms  of  the  dictionaries.  Also  let  D'  = 

D'd]  e  RnxSd,  x’  =  \xlT  . . .  xsT 1  e  M”,  and 
a ’  =  [ai_> . . .  a.<i-A  G  R's'(i  .  Then  (7)  can  be  rewritten  as 

I \\x'  -  D'a'\\l  +  Ai  y]  ||aj_>||*2  +  y  \\A\\2F.  (25) 
This  can  be  further  converted  into  the  standard  group  lasso: 

111®"  -  D"a'fl2  +  Ai  ^  IK^k,  (26) 

3  = 1 

r  -|  T 

where  x"  =  xfT0T  G  Mn+S'd  and  D"  = 

D,T \pv2l\^  G  ^(n+Sd)xSd m  it  is  clear  that  the  matrix 
D"  is  full  column  rank.  The  rest  of  the  proof  follows 
directly  from  the  results  in  [72].  ■ 

Proof  of  Proposition  3.1:  The  above  proposition 
implies  that  A*  is  differentiable  almost  everywhere.  We 
know  prove  the  proposition  3.1.  It  is  easy  to  show  that  / 
is  differentiable  with  respect  to  ws  due  to  the  assumption 
(A)  and  the  fact  that  lsu  is  twice  differentiable.  /  is 
also  differentiable  with  respect  to  Ds  given  assumption 
(A),  twice  differentiability  of  lsu ,  and  the  fact  that  A* 
is  differentiable  everywhere  except  on  a  set  of  measure 
zero  (Prop  A.l).  We  obtain  the  derivative  of  /  with  respect 
to  Ds  using  the  chain  rule.  The  steps  are  similar  to  those 


taken  for  -related  optimization  in  [36],  though  a  bit  more 
involved.  Since  the  active  set  is  locally  constant,  using  the 
optimality  condition  (24),  we  can  implicitly  differentiate 
A*({ccs,  19s})  with  respect  to  Ds .  For  the  non-active  rows 
of  A*,  the  differential  is  zero.  On  the  active  set  A,  (24)  can 
be  rewritten  as 


D\T  (x1 


|-jl  1 ' 

U  OL 


xs  -  Dsas 


A2  A\^  —  Ai 


where  N  is  the  cardinality  of  A  and  D\  and  A\^  are  the 
matrices  consisting  of  active  columns  of  D  and  active  rows 
of  A*,  respectively.  For  the  rest  of  the  proof,  we  only  work 
on  the  active  set  and  the  symbols  A  and  ★  are  dropped  for 
the  ease  of  notation.  Taking  the  partial  derivative  from  both 
sides  of  (27)  with  respect  to  dL,  the  element  in  the  ith- row 
and  jth- column  of  19s,  and  taking  its  transpose  we  have: 


(Dscts  -  Xs Y  E\.  +  OLsTEsi3TDs 

0 


d1 


j—DsD‘ 


da ^ 


where  Ef-  G  ]RnsxAr  is  a  matrix  with  zero  elements  except 
the  element  in  the  ith  row  and  jth  column  which  is  one 
and 


T 

\yTak^ak- 

I \e2 


Vfc  G  {1, . . . ,  N}.  It  is  easy  to  check  that  A&  >  0. 
Vectorizing  the  both  sides  and  factorizing  results  in 


(xs  -  Dsots 


ocsT  EfjT  d\ 


(xs  -  Dsasf  e? 


sT  171  s  T  js 

-  a  E{j  dN 


the  kth  column  of  Ef, 


where  e\-k  is  the  kttl  column  of  Ef- ,  P  = 

(jjTD  +  Ai  A  +  A2J^  ,  and  D  and  A  are  defined 
in  Eqs.  (19)  and  (20),  respectively.  Further  simplifying 
Eq.  (28)  yields 


PsEty  (xs  -  Dsas)  -  PsdUTas, , 


where  s  is  defined  in  Eq.  (21).  Using  the  chain  rule,  we 
have 

df  _  [  T  (dAT\ 

7^  = E  9  vec  —  , 
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/  q  i  \ 

where  g  =  vec  f  qXt  su  )  •  Therefore,  derivative  with 
respective  to  the  active  columns  of  dictionary  Ds  is 


...  gTP~s(E°nsNT{x°-D°a.°)-dsn 
(xs  -  Dscxs )  gTPs  -  DSP[ gasT 


=  E 


Setting  (3  =  PTg  G  RNS  and  noting  that  /3§  =  Pfg 
complete  the  proof.  ■ 

Derivation  of  the  algorithm  with  the  mixed  £12  —  £\\ 
prior  can  be  obtained  similarly.  For  each  active  row  j  G  A 
of  A*,  the  solution  of  the  optimization  problem  (23)  with 
the  mixed  prior,  let  Uj  C  S  be  the  set  of  active  modalities 
which  have  non-zeros  entries.  Then  the  optimality  condition 
for  the  active  row  j  is 


iiT 


[x1  —  .D1a1*^  . . .  djT  (xs 


Dsots 


■)L 


A20*.  —  Ai 


*H,n  j 


K- 


IU2 


+  A;  sign  (a*^n.)  . 


Then,  the  algorithm  for  the  mixed  prior  can  be  obtained  by 
differentiating  the  optimality  condition,  following  similar 
steps  as  was  shown  for  the  £12  prior. 
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